Last year, the number of software developers was estimated to be 26.4 million. The field thrives on open source tools & technologies, with new tools created every day. Given the sheer number of tools available, it's very possible that millions of unique combinations of these tools and technologies exist amongst software developers.

This project applies Data Cleaning, Exploratory Data Analysis & Machine Learning techniques to better understand the connection between a developer's tools and their profile type.


The Presentation

Overview of the Data

  • Top Skills
  • Top Profiles

Defining our Dataset

  • Database
  • Machine Learning (ML)
  • Quality Assurance (QA)
  • Frontend
  • Fullstack


Transforming the Data

Reducing Noise with Feature Selection

The Machine Learning Models

Logistic Regression

Class Balance
Benchmark Model
Scoring with the ROC Curve
Class Imbalance with PySpark
Model Performance Metrics

Binary Decision Tree

Benchmark Model
Model Performance with the Classification Report
Scoring with ROC_AUC
Visualizing the Decision Tree
Model Performance Metrics

Multinomial (Multi-class) Random Forest

Benchmark Model
Model Performance with the Classification Report

The Data

In [8]:
data_overview = pd.read_csv("../Celerative_Working/Data_Cleaning/data_overview.csv")
data_overview.drop(["Unnamed: 0"], axis=1, inplace=True)
data_overview.head(5)
Out[8]:
name profile skills
0 Hugo L. Samayoa DevOps Developer ['Agile Software Development', 'Kubernetes', '...
1 Stepan Yakovenko Software Developer ['Debian Linux', 'MySQL', 'HTML5', 'GWT', 'Aja...
2 Slobodan Gajic Software Developer ['Firebase', 'JSON', 'CSS', 'Angular', 'jQuery...
3 Bruno Furtado Montes Oliveira Visual Studio Team Services (VSTS) Developer ['Agile', 'Debian Linux', 'PostgreSQL', 'Pytho...
4 Jennifer Aquino Query Optimization Developer ['Automation', 'SQL Server Integration Service...

How An Algorithm sees the Data

Observations or Samples Classes, Predicted vs. Actual (y) Features (X)
observation #1 profile_A skill_1, skill_5, skill_3...
observation #2 profile_B skill_5, skill_2, skill_4...
observation #3 profile_C skill_4, skill1_2, skill_1...
observation #4 profile_B skill_4, skill1_1, skill_5...
observation #5 profile_A skill_3, skill1_4, skill_5...

Data Cleaning

Noise is data which does not add meaningful information for the algorithm.


Controlling for noise yields more reliable predictive analysis when applying Machine Learning algorithms.

This is an example of noisy data. Left unchecked, this could cause the algorithm to return results which don't acurately reflect the true data:

Similar Strings
'apache' 'apache spark'
'apache ant' 'apache tomcat'
'rest' 'rest apis'
'android studio' 'android' 'android sdk'
'asp.net mvc' 'asp.net' 'asp'
'visual studio code' 'microsoft visual studio'
'angular js' 'angular'
'c#' 'c' 'c++'
'asp.net mvc' 'asp.net'
'google app engine' 'google cloud' 'google compute engine'
'google analytics' 'google adsense' 'google adwords',
'phpstorm' 'php'
'net' 'net framework'
'aws ec2' 'aws s3

Before

In [9]:
print("Before the data was cleaned and processed, the dataset consisted of:\n")
print(
    "\t - ",
    data_overview.profile.nunique(),
    'unique profile types or "classes" \n',
)
print("\t & ")
print(
    "\n\t - ",
    data_overview.skills.explode().nunique(),
    'skills or "features"',
)
Before the data was cleaned and processed, the dataset consisted of:

	 -  2672 unique profile types or "classes" 

	 & 

	 -  5496 skills or "features"

After

In [12]:
print(
    "Using Natural Language Processing and Text Pre-Processing techniques provided via the Scikit-Learn library, skills & profile strings were standardized to:"
)
print("\t - ", clean_talentpool.profile.nunique(), "profiles (classes)")
print("\t & ")
print("\t - ", clean_talentpool.skills.explode().nunique(), "skills (features)")
Using Natural Language Processing and Text Pre-Processing techniques provided via the Scikit-Learn library, skills & profile strings were standardized to:
	 -  415 profiles (classes)
	 & 
	 -  780 skills (features)

Top Skills

In [14]:
# Generating wordcloud
wordcloud = WordCloud(
    width=3000,
    height=2000,
    random_state=1,
    max_words=250,
    background_color="black",
    colormap="Blues",
    collocations=False,
    stopwords=STOPWORDS,
).generate(
    series_to_str(clean_talentpool["skills"])
)  # Plot
plot_cloud(wordcloud)

Top Profiles

In [16]:
pd.DataFrame(clean_talentpool["profile"].value_counts())[1:19]
Out[16]:
profile
quality assurance 79
front-end developer 69
javascript developer 64
ios developer 55
full-stack developer 54
frontend php javascript 51
machine learning engineer 50
python developer 48
php developer 47
java developer 41
android developer 31
javascript frontend design 28
c# developer 24
ruby on rails developer 22
architecture developer 22
software architect 21
design 20
windows presentation foundation (wpf) developer 19

It's clear that a few of the top profile types are described by their key skills only, such as "frontend php javascript" , "javascript frontend design", etc. These top "keywords" may reveal more than about the data's profiles when isolated from non-descritptive words like "developer" and "engineer".

In [18]:
# Drop empty row & subsetting for top 25 keywords
dotplot.drop(14, inplace=True)
dotplot.reset_index(drop=True)[:25]
Out[18]:
profile_keywords count
0 javascript 411
1 frontend 339
2 php 164
3 design 157
4 fullstack 150
5 ios 119
6 senior 118
7 python 102
8 java 97
9 ruby 91
10 android 90
11 web 87
12 qualityassurance 79
13 data 77
14 net 73
15 machinelearning 60
16 devops 56
17 architecture 53
18 c 43
19 rails 37
20 middle 36
21 backend 30
22 testing 30
23 wordpress 30
24 system 27
In [19]:
# Draw plot
y = dotplot.profile_keywords[:45]

fig, ax = plt.subplots(figsize=(10, 12), dpi=80)
ax.hlines(
    y=y,
    xmin=10,
    xmax=205,
    color="gray",
    alpha=0.7,
    linewidth=1,
    linestyles="dashdot",
)
ax.scatter(
    y=y,
    x=dotplot.loc[:45, ["count"]],
    s=75,
    color="firebrick",
    alpha=0.7,
)

# Title, Label, Ticks and Ylim
ax.set_title("Most Common Profile Keywords", fontdict={"size": "22"})
ax.set_xlabel("Count of Keywords", fontdict={"size": "22"})
ax.set_yticklabels(y, fontdict={"horizontalalignment": "right", "size": "14"})
ax.set_xlim(5, 200)
plt.show()

Defining our Dataset

For the purposes of this presentation, we will narrow our focus to look at machine learning model performance using a dataset made up of five (5) profiles:

  1. "Database"
  2. "ML" combined those profiles containing the phrase "machine learning"
  3. "QA" combined those profiles containing the phrase "quality assurance"
  4. "Fullstack"
  5. "Frontend"

Makeup of the Dataset

In [22]:
# Get percentage of developer profiles
(five_profiles.profile.value_counts(normalize=True) * 100).astype(int)
Out[22]:
Frontend     48
Fullstack    21
QA           11
Database      9
ML            8
Name: profile, dtype: int32
In [24]:
print(
    "Although profile types have been cut to 5 broad profile categories, \
    \nthere are still",
    ex_skills.skills.nunique(),
    "skills (vs. the 812 seen with all profiles). \
    \nSo although many of the classes have been cut, most skills remain,\
    \nqualifying the exploration of this particular subset of the data as valid.",
)
Although profile types have been cut to 5 broad profile categories,     
there are still 527 skills (vs. the 812 seen with all profiles).     
So although many of the classes have been cut, most skills remain,    
qualifying the exploration of this particular subset of the data as valid.

Transforming the Data

Our Data's Format

Observations/Samples Classes (y) Features (X)
observation #1 profile_A skill_1, skill_5, skill_3...
observation #2 profile_B skill_5, skill_2, skill_4...
observation #3 profile_C skill_4, skill1_2, skill_1...
observation #4 profile_B skill_4, skill1_1, skill_5...
observation #5 profile_A skill_3, skill1_4, skill_5...

Machine Learning Data Format

Observations/Samples Classes (y) Feature1 (X) Feature2 (X) Feature3 (X) Feature4 (X)...
observation #1 profile_A skill_1 skill_2 skill_3 skill_4 ...
observation #2 profile_B skill_1 skill_2 skill_3 skill_4 ...
observation #3 profile_C skill_1 skill_2 skill_3 skill_4 ...
observation #4 profile_B skill_1 skill_2 skill_3 skill_4 ...
observation #5 profile_A skill_1 skill_2 skill_3 skill_4 ...

The Sci-Kit Learn MultiLabelBinarizer() Object

sklearn's MultiLabelBinarizer() Object is used to take each skill and expand it, so that each skill (feature) has its own column. Should that sample, or developer, have the specified skill, the corresponding row is marked with a 1. Otherwise, it's marked with a 0.

This effectively converts all skills to binary variables, which allows the features (skills) to be represented on the x-axis, while observations are represented on the y-axis. Machine Learning algorithms need data formatted in this way in order to process it correctly.

The Result

In [27]:
df = binarized(ex_skills)
df.iloc[:, 1:8]
Out[27]:
profile skills .net .net 4 .net core .net framework ad planningand buying
0 Fullstack [rest, visual studio code, json, full-stack, l... 0 0 0 0 0
1 Frontend [html, ux design, javascript, type script, red... 0 0 0 0 0
2 ML [data science, linux, apache hive, python 3, m... 0 0 0 0 0
3 ML [data science, flask, mysql, python, linux rhe... 0 0 0 0 0
4 QA [javascript, ruby on rails (ror), jenkins] 0 0 0 0 0
... ... ... ... ... ... ... ...
628 Database [data science, postgresql, r, aws lambda, opti... 0 0 0 0 0
629 Frontend [html, javascript, php, laravel, wordpress, ph... 0 0 0 0 0
630 Frontend [html, java, spring, javascript, node, js, dat... 0 0 0 0 0
631 Frontend [html, automated testing, phantom js, wirefram... 0 0 0 0 0
632 ML [test-driven development (tdd), amazon web ser... 0 0 0 0 0

633 rows × 7 columns

X

is recognized as the variable which stores the "input variables", or features on which the algorithm is trained.
Here features are assigned to X:

In [28]:
# Assign features to X
X = df.iloc[:, 5:]
X
Out[28]:
.net core .net framework ad planningand buying ado.net adobe illustrator adobe photoshop affiliate marketing afnetworking agile ajax ... writing wwf x xamarin xcode xml yii zend framework zeplin zurb foundation
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
628 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
629 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
630 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
631 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
632 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

633 rows × 525 columns

y

is recognized as the variable which stores class labels in classification problems.
Here class labels are assigned to y:

In [29]:
# Assign target class to y
y = df.profile
y
Out[29]:
0      Fullstack
1       Frontend
2             ML
3             ML
4             QA
         ...    
628     Database
629     Frontend
630     Frontend
631     Frontend
632           ML
Name: profile, Length: 633, dtype: object

Now that the data has been formatted to be understandable to the algorithm, inconsistencies or noise in the data should be addressed.

Feature Engineering or Feature Selection

involves choosing features of the data that are more relevant to the model performing well.
This makes data input to the model less noisy, lets the model train faster, simplifies the model for interpretation, and improves generalization to outside data.

- Recursive Feature Elimination (RFE)

- Correlation

Correlation

In [30]:
# Create the correlation matrix
corr_matrix = X.corr()
# Subset the matrix
corr_matrix.iloc[:11, :11]
Out[30]:
.net core .net framework ad planningand buying ado.net adobe illustrator adobe photoshop affiliate marketing afnetworking agile ajax akamai
.net core 1.000000 -0.001582 -0.007897 -0.001582 -0.001582 -0.004500 -0.002745 -0.001582 -0.029539 -0.009329 -0.003172
.net framework -0.001582 1.000000 -0.007897 -0.001582 -0.001582 -0.004500 -0.002745 -0.001582 -0.029539 0.169613 -0.003172
ad planningand buying -0.007897 -0.007897 1.000000 -0.007897 -0.007897 -0.022460 0.106738 -0.007897 0.198199 -0.046556 0.088553
ado.net -0.001582 -0.001582 -0.007897 1.000000 -0.001582 -0.004500 -0.002745 -0.001582 -0.029539 -0.009329 -0.003172
adobe illustrator -0.001582 -0.001582 -0.007897 -0.001582 1.000000 -0.004500 -0.002745 -0.001582 -0.029539 -0.009329 -0.003172
adobe photoshop -0.004500 -0.004500 -0.022460 -0.004500 -0.004500 1.000000 -0.007807 -0.004500 -0.054471 -0.026533 -0.009022
affiliate marketing -0.002745 -0.002745 0.106738 -0.002745 -0.002745 -0.007807 1.000000 -0.002745 0.044868 -0.016183 -0.005503
afnetworking -0.001582 -0.001582 -0.007897 -0.001582 -0.001582 -0.004500 -0.002745 1.000000 -0.029539 -0.009329 -0.003172
agile -0.029539 -0.029539 0.198199 -0.029539 -0.029539 -0.054471 0.044868 -0.029539 1.000000 -0.144463 0.065734
ajax -0.009329 0.169613 -0.046556 -0.009329 -0.009329 -0.026533 -0.016183 -0.009329 -0.144463 1.000000 -0.018702
akamai -0.003172 -0.003172 0.088553 -0.003172 -0.003172 -0.009022 -0.005503 -0.003172 0.065734 -0.018702 1.000000
In [31]:
# Selecting for those columns with correlation relationship less than 50%
columns = np.full((corr_matrix.shape[0],), True, dtype=bool)
for i in range(corr_matrix.shape[0]):
    for j in range(i + 1, corr_matrix.shape[0]):
        if corr_matrix.iloc[i, j] >= 0.5:
            if columns[j]:
                columns[j] = False

By selecting for features more than 50% dissimilar, features have been reduced from 526 to 278, effectively reducing the "dimensionality" or column vectors of the dataset. This is one of an array of techniques used to select for the most important features of a dataset.

In [33]:
# newly selected features
X
Out[33]:
.net core .net framework ad planningand buying ado.net adobe illustrator adobe photoshop affiliate marketing afnetworking agile ajax ... web developer web gl webpack windows windows mobile windows powershell windows presentation foundation (wpf) xamarin xml zurb foundation
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
628 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
629 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
630 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
631 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
632 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

633 rows × 278 columns

Recursive Feature Elimination (RFE)

Dimensionality Reduction, Matrix style

In [34]:
# Performing RFE
rForest = RandomForestClassifier()
rfe = RFE(rForest).fit(X, y)
# Rank of 1 indicates the features selected
rfe.ranking_
Out[34]:
array([ 51,   1,   1, 114,   1,   1, 135,  75,   1,   1, 130,  72,   1,
         1,  79,   1,  34,   1,   1,   1,   1,  87,   1,   1,   1,   1,
         1,   1,  58, 108,   1,   1,   1,  20,   1,   1,   1,   1,   1,
         1,   1,  84,   1,  65,  32,   1,   1, 131, 138, 107,   1, 116,
        16,   1,  66, 110,   1,   1, 128,  14,  46,   1,  61, 113,   1,
         7,   1,   1, 106,   1, 115,   6,   1, 124,   1, 104, 109,   1,
       133,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  36,   1,
         1,   1, 100,   1,  80,  17,  26,   1,  27,  88,   1,  85,   1,
         1,   1,  42,  64,   1,   1,   1,   1,   1,   1,   1,  62,   1,
        22,   1,   4,   1,   1,  63,   1,   1, 129,  92,   1, 140, 132,
        41,  99,   1,   1,  77,  48,  96,   1,   1,  83,   1,   1,  15,
        90,   1,   1,   1,   1, 125,  78,  74,   2,   1,  60,   1,  67,
       139, 123, 121,  40, 122,   1,  35,   1,   1,  95,  69,  57,  12,
       112,   1,  54,  10,   1,   1,   1,  86, 117,  13, 118,   1,   1,
         1,   1,  43,   8,   1,  38,   1,  25,  30,  44,  82,  49, 102,
         1,   1,   1,   1,  29,  97,   1,   1,   1,  81,   1,   1,  50,
         1,  98,  39,   1,   1,  37, 101,   1,  21, 134,   1,   1,   9,
       119,  18,   1,   1, 105,  68,   1,  55,  91,   1,  53,   1,  11,
         5,   1,  47,   1,  31,  45,   3,  52,  94,  59,   1,   1,   1,
        19, 111,   1, 120,   1,   1,   1, 103,  28,   1,  76,  70,   1,
         1, 137, 136,  73,  71, 127,   1,  56,   1,  24,   1,   1,  93,
       126,  23,  33,   1,  89])
In [35]:
from IPython.display import IFrame

IFrame(
    "https://giphy.com/embed/AOSwwqVjNZlDO",
    width="100%",
    height="600",
    scrolling="no",
    frameborder="0",
)
Out[35]:

Optimizing Recursive Feature Elimination (RFE):
The number of features returning the best balanced_accuracy score is 130 features. This is the optimal number of features at which the random forest algorithm performs best!

RFE chooses features of the data that are more relevant to the model performing well. Doing so makes the data input to the model less noisy, lets the model train faster, simplifies the model for interpretation, and improves generalisation to outside data.

Use of K_Means methods to find patterns in the data

Use of the elbow method reveals the optimal number of cluster by which to cluster to data:

In [44]:
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1, 130))

visualizer.fit(X_T)  # Fit the data to the visualizer
visualizer.show()
Out[44]:
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

DBSCAN (Density-based spatial clustering of applications with noise)

is actually an unsupervised machine learning algorithm by itself, but is often used as an Exploratory Data Analysis technique as it reveals natural clusters in the data. The graph above has grouped the remaining 130 features (or skills) into 16 clusters. Those will be used below to find features of importance.

  • DBSCAN does not require us that the number of clusters be specified a-priori (beforehand), avoids outliers, and works quite well with arbitrarily shaped and sized clusters.

The silhouette ranges from −1 to +1, with a high value indicating that the object is well matched to its own cluster and poorly matched to neighboring clusters. So a value closer to 0, like ours here, indicates the clusters aren't well separated and share similarity to other clusters, which matches what we see with feature variance in that not many features are causing a high amount of variance in the predictive analysis.

In [45]:
kmeans = KMeans(n_clusters=34).fit(X_130)
normalized_vectors = preprocessing.normalize(X_130)
normalized_kmeans = KMeans(n_clusters=34).fit(normalized_vectors)
min_samples = X_130.shape[1] + 1
dbscan = DBSCAN(eps=3.5, min_samples=min_samples).fit(X_130)
# Print results
print(
    "kmeans: {0:10.4f}".format(
        silhouette_score(X_130, kmeans.labels_, metric="euclidean")
    )
)
print(
    "Cosine kmeans:{0:10.4f}".format(
        silhouette_score(
            normalized_vectors, normalized_kmeans.labels_, metric="cosine"
        ),
    )
)
kmeans:     0.0815
Cosine kmeans:    0.1221

The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends. Below, we select the top ten (10) variables with the highest variance. A high variance indicates that this variable is likely to play an important role in creating the clusters.

In [113]:
results = pd.DataFrame(columns=["Variable", "Var"])
for column in df_mean.columns[1:]:
    results.loc[len(results), :] = [column, np.var(df_mean[column])]
selected_columns = (
    list(
        results.sort_values(
            "Var",
            ascending=False,
        )
        .head(10)
        .Variable.values
    )
    + ["dbscan"]
)
tidy = df_scaled[selected_columns].melt(id_vars="dbscan")
sns.barplot(x="dbscan", y="value", hue="variable", data=tidy)
Out[113]:
<AxesSubplot:xlabel='dbscan', ylabel='value'>
In [114]:
print(X_130.columns.to_numpy()[66])
print(X_130.columns.to_numpy()[89])
print(X_130.columns.to_numpy()[90])
print(X_130.columns.to_numpy()[91])
print(X_130.columns.to_numpy()[92])
print(X_130.columns.to_numpy()[93])
print(X_130.columns.to_numpy()[94])
print(X_130.columns.to_numpy()[95])
print(X_130.columns.to_numpy()[96])
go
mongodb
ms sql server
mysql
node.js
object-oriented programming (oop)
opencart
opencv
oracle

Combining Clustering & Correlation techniques to visualize patterns amongst 75 features

In [53]:
# Set y_ticks
y_axis_labels = list(df.index)
# Normalize the data
df_norm_col = (df - df.mean()) / df.std()
# Plot the data
ax = sns.clustermap(df_norm_col, yticklabels=y_axis_labels)

Factor Analysis

In [55]:
X_factor = factor_data.iloc[:, :]
y_factor = factor_data.index
In [56]:
# Using Kaiser-Meyer-Olkin (KMO) value to validate adequate sample size
kmo_all, kmo_model = calculate_kmo(X_factor)
print(format(kmo_model,".4f"))
0.6660

In order to conduct exploratory factor analysis, the Kaiser-Meyer-Olkin (KMO) value must be greater than 0.5, as this shows that the sample size is fit for factor analysis. The closer the value is to 1, the better. Although the below calculated KMO value isn't ideal, it indicates that factor analysis is viable. This value could be improved upon with a bigger sample size, i.e. by collecting more data.

An eigenvalue returned by factor analysis that is greater than 1, is considered as selection criteria for the feature. This is because the eigenvalue represents the amount of variance created by that variable. So an eigenvalue with a value of 4.04, for example, accounts for variance created by 4.04 variables.

In [60]:
# Create scree plot using matplotlib
plt.scatter(range(1, X_factor.shape[1] + 1), ev)
plt.plot(range(1, X_factor.shape[1] + 1), ev)
plt.title("Scree Plot")
plt.xlabel("Factors")
plt.ylabel("Eigenvalue")
plt.grid()
plt.show()

The newly created latent factors, i.e. features created from the observable features:

In [62]:
Factor_1 = []
Factor_2 = []
Factor_3 = []
for idx, row in fa_loadings.iterrows():
    if row.Fact1_loadings_ > 0.1:
        Factor_1.append(row.Original_Factors)
    elif row.Fact2_loadings_ > 0.1:
        Factor_2.append(row.Original_Factors)
    elif row.Fact3_loadings_ > 0.1:
        Factor_3.append(row.Original_Factors)
print("Factor_1: ", Factor_1)
print("\nFactor_2: ", Factor_2)
print("\nFactor_3: ", Factor_3)
Factor_1:  ['agile', 'android', 'angular js', 'architecture', 'azure', 'brand design', 'chrome apps', 'cordova', 'd3', 'django', 'drupal', 'flask', 'go', 'grunt', 'java', 'linux', 'oracle']

Factor_2:  ['adobe photoshop', 'algorithms', 'amazon web services (aws)', 'aws lambda', 'database modeling', 'eclipse', 'front-end', 'full-stack', 'functional programming', 'jenkins', 'json', 'macos', 'mapreduce', 'matlab', 'node.js', 'object-oriented programming (oop)', 'pandas', 'postgresql', 'python 3', 'rest', 'ruby on rails (ror)', 'scikit-learn', 'test-driven development (tdd)', 'testng']

Factor_3:  ['apache spark', 'asp.net', 'cassandra', 'data warehouse', 'database design', 'etl', 'microsoft sql server', 't-sql', 'tableau', 'tensorflow']

How can the viability of the returned factors be measured? With the Cronbach alpha, factors can be measured to determine whether or not the variables of a factor form a “coherent” and reliable factor. A value above 0.6 for the alpha is in practice deemed acceptable. Here the Cronbach alpha is calculated using the pingouin package:

In [64]:
# Get cronbach alpha
factor1_alpha = pg.cronbach_alpha(factor1)
factor2_alpha = pg.cronbach_alpha(factor2)
factor3_alpha = pg.cronbach_alpha(factor3)

print(
    "factor1_alpha: ",
    factor1_alpha,
    "\nfactor2_alpha: ",
    factor2_alpha,
    "\nfactor3_alpha: ",
    factor3_alpha,
)
factor1_alpha:  (0.8049658949086095, array([0.782, 0.826])) 
factor2_alpha:  (0.4367866900792546, array([0.371, 0.498])) 
factor3_alpha:  (0.47037097717016946, array([0.407, 0.53 ]))

Logistic Regression

Correlation Amongst All Skills

In [66]:
fig, ax = plt.subplots(figsize=(16, 16))

data_corr = data.corr()
cmap = sns.mpl_palette("Set3", 4)

# Generate a Heatmap
sns.heatmap(data_corr, cbar_kws={"shrink": 0.8}, square=True, cmap=cmap)

# title
title = "Correlation Between All Skills"
plt.title(title, loc="center", fontsize=24)
Out[66]:
Text(0.5, 1.0, 'Correlation Between All Skills')

Visualizing Class Balance

It's important that classes be balanced because with strong class imbalance, a model can seem to perform well simply by predicting that every class is the majority class. But this is not indicative of overall model accuracy.

In [69]:
# Looking at Class Imbalance
sns.countplot(x="Frontend", data=data, palette="hls")
print(
    "Percentage of frontend developers is: ",
    "{:.2f}".format(len(data[data["profile"] == "Frontend"]) / len(data) * 100),
    "%",
)
print(
    "Percentage of non-frontend developers: ",
    "{:.2f}".format(
        (len(data) - len(data[data["profile"] == "Frontend"])) / len(data) * 100
    ),
    "%",
)
plt.show()
Percentage of frontend developers is:  45.94 %
Percentage of non-frontend developers:  54.06 %

Fitting the Logistic Regression Model to our Data

In [71]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[71]:
LogisticRegression()

Measuring Model Performance

  • Accuracy
  • The Confusion Matrix
  • The Classification Report
  • ROC_AUC_score
  • ROC Curve

Scoring Metrics

Accuracy can be a useful measure when the same amount of samples exist per class in a dataset. But with an imbalanced set of samples, accuracy is biased to the size of the test data. A more robust scoring metric is the f1 metric, used along with cross_val_score to score the model across multiple test runs.

In [72]:
# Generating predictions by which to measure accuracy of the fitted model
y_pred = logreg.predict(X_test)
print(
    "The f1 score of the logistic regression classifier on test set: ",
    "{:.3f}".format(
        np.mean(cross_val_score(logreg, X_test, y_test, scoring="f1", cv=5))
    ),
)
The f1 score of the logistic regression classifier on test set:  0.834

The Confusion Matrix

This is the basis from which all other metrics are calculated. It provides a simple, but clear, picture which shows how the model is performing.

In [73]:
# Confusion Matrix
cm = metrics.confusion_matrix(y_test, y_pred)
group_names = ["True Negative", "False Positive", "False Negative", "True Positive"]
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [
    f"{v1}\n{v2}\n{v3}"
    for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
]
labels = np.asarray(labels).reshape(2, 2)
sns.heatmap(cm, annot=labels, fmt="")
Out[73]:
<AxesSubplot:>

The ROC Curve

This is a commonly used curve that summarizes the performance of a classifier over all possible thresholds. It's generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as the threshold for assigning observations to a given class is varied.

In [75]:
plt.subplots(1, figsize=(10, 10))
plt.title("Receiver Operating Characteristic (ROC) : Logistic regression")
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0], c=".7"), plt.plot([1, 1], c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

ROC_AUC Score

In order to return one number that reveals how good the ROC_AUC curve is, calculate the Area Under the ROC Curve, or the ROC_AUC score. This tells us how good the model is at ranking predictions.

In [76]:
# The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes
print(
    "The roc_auc_score for the Logistic Regression classifier: ",
    "{:.2f}".format(roc_auc_score(y_test, y_score)),
)
The roc_auc_score for the Logistic Regression classifier:  0.96

Logistic Regression with PySpark

In [80]:
sdf = sqlCtx.createDataFrame(data)

Transforming the Data to Spark

In [81]:
# Build a dict (orginal column name: new column name)
mapping = {col: col.replace(".", "_").replace(" ", "_") for col in selected_columns}
In [82]:
# Select all columns and create an alias if there is a mapping for this column
df_renamed = sdf.select(
    [F.col("`" + c + "`").alias(mapping.get(c, c)) for c in sdf.columns]
)
In [83]:
# Create a VectorAssembler obj using the renamed columns as input
assembler = VectorAssembler(inputCols=list(mapping.values()), outputCol="features")
In [84]:
transformed_data = assembler.transform(df_renamed)
In [85]:
# Rename "Fullstack" column to "label"
data = transformed_data.withColumnRenamed("Fullstack", "label")

Imbalancing handling

In our dataset (train) we have 36.04 % negatives and 63.96 % positives. Since positives are in a majority. Therefore, logistic loss objective function should treat the negative class (Outcome == 0) with higher weight. For this purpose we calculate the BalancingRatio as follows:

Against every Outcome == 0, the calculated BalancingRatio is assigned to the corresponding row under the “classWeights” column. Likewise with every Outcome == 1, 1-BalancingRatio is assigned to the corresponding row in the “classWeights” column.

In [87]:
data = data.withColumn(
    "classWeights",
    F.when(data.label == 0, BalancingRatio).otherwise(1 - BalancingRatio),
)

Create PySpark Logistic Regression Model

In [89]:
from pyspark.ml.classification import LogisticRegression

# Create a logistic regression object
lr = LogisticRegression(
    featuresCol="features", labelCol="label", weightCol="classWeights", maxIter=100
)
lrModel = lr.fit(train)
trainingSummary = lrModel.summary

Visualize Model Performance

In [91]:
# Plot the recall-precision curve
pr = trainingSummary.pr.toPandas()
plt.plot(pr["precision"], pr["recall"])
plt.xlabel("Precision / 1 - FPR")
plt.ylabel("Recall / TPR")
plt.show()
In [92]:
# Plot the threshold-precision curve
tp = trainingSummary.precisionByThreshold.toPandas()
plt.plot(tp["threshold"], tp["precision"])
plt.xlabel("Threshold")
plt.ylabel("Precision")
plt.show()

Binary Decision Tree

In [99]:
# Generating the DecisionTree model
classifier = DecisionTreeClassifier()
# Fitting the model
classifier.fit(X_train, y_train)
Out[99]:
DecisionTreeClassifier()

The Classification Report

While the below generated classification report reveals this Binary Decision Tree model to score an accuracy rate of 82%, a Multinomial (or Multi-class) model trained with exactly the same data and with the same model specifications will score an accuracy rate of 68% in predicting the five (5) classes present.

In [100]:
# Making predictions with the model
y_pred = classifier.predict(X_test)
# Generating the classification report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.87      0.84      0.85       140
           1       0.81      0.85      0.83       116

    accuracy                           0.84       256
   macro avg       0.84      0.84      0.84       256
weighted avg       0.85      0.84      0.84       256

The ROC Curve

In [101]:
from yellowbrick.classifier import ROCAUC

# Instantiating the visualizer with the Decision Tree model
visualizer = ROCAUC(classifier, classes=["Frontend", "Non-Frontend"])

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()
Out[101]:
<AxesSubplot:title={'center':'ROC Curves for DecisionTreeClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>

Visualizing the Decision Tree's Strcuture

The tree will stabilize at a max_depth of 125 on its own, but for the sake of this visual, the max_depth has been set to 25.

In [103]:
clf = DecisionTreeClassifier(
    max_depth=25
)  # max_depth is maximum number of levels in the tree
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(14, 14), dpi=600)
clf = clf.fit(X, y)
plot_tree(clf, filled=True, feature_names=feature_names, class_names=class_names)
fig.set_size_inches(3, 6)

Multinomial Random Forest

Random Forests see high performance with data of high dimensionality. Furthermore, the algorithm handles outliers and unbalanced data well.

Fit the Random Forest Model to our Data

In [109]:
# Generate the RandomForest model
forest = RandomForestClassifier(class_weight="balanced", random_state=0)
# Fit the model
forest.fit(X_train, y_train)
Out[109]:
RandomForestClassifier(class_weight='balanced', random_state=0)

The Classification Report

  • Precision — What percent of your predictions were correct?
  • Recall — What percent of the positive cases did you catch?
  • F1 score — What percent of positive predictions were correct?
  • Support is the number of actual occurrences of the class in the specified dataset.
In [111]:
from yellowbrick.classifier import ClassificationReport

visualizer = ClassificationReport(forest, classes=classes, support=True)

visualizer.fit(X_train, y_train)  # Fit the visualizer and the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()
Out[111]:
<AxesSubplot:title={'center':'RandomForestClassifier Classification Report'}>

The Multi-Class ROC Curve

In [112]:
from yellowbrick.classifier import ROCAUC

# Instantiating the visualizer with the Random Forest model
visualizer = ROCAUC(forest, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()
Out[112]:
<AxesSubplot:title={'center':'ROC Curves for RandomForestClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>

Conclusion


The original data's format required some creativite re-formatting in order to see it correctly processed by our Machine Learning algorithms. But ultimately, we were able to implement a binary classification problem with Logistic Regression & generate viable results with Multi-Class Random Forest model. Although these algorithms are powerful in their ability to make sense of data, it’s clear that the creativity and insight which only people can provide, will continue to be fundamental to the process of Machine Learning.

Future Work


In the future, I hope to investigate the information supplied by the predict.proba method in more depth. This method shows the probability that an observation belongs to a certain class. It sheds light on which samples the algorithm may be classifying with less certainty versus those classified with more certainty. A closer look at the skills of those profiles may lend insight as to which skills are determinative in predicting a profile type.